{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "changed-military",
   "metadata": {},
   "source": [
    "# ML 4 Timeseries: Mortality Prediction\n",
    "In this lab you will learn and implement a basic machine learning pipeline for time-series analysis mortality  prediction using ScikitLearn.\n",
    "\n",
    "Lab designed by: Anshul Thakur anshul.thakur@eng.ox.ac.uk<br>\n",
    "For more details on the lab, please see our [github repository](https://github.com/AnshThakur/CDT-TimeSeries)\n",
    "\n",
    "## Your tasks:\n",
    "1. Visualse the distribution of features & an example over time\n",
    "2. Generate statistical summary features\n",
    "    - $\\bar{\\mathbf{x}}$\n",
    "    - $\\bar{\\mathbf{x}}$ $\\pm$ $\\sigma$\n",
    "    - $\\bar{\\mathbf{x}}$ $\\pm$ $\\tilde{\\mathbf{x}}$\n",
    "3. Implement stratified k-fold cross validation\n",
    "4. Apply the necessay pre-processing steps \n",
    "    - normalisation\n",
    "    - outlier removal: mean imputation\n",
    "5. Train two off-the-shelf machine learning models for mortality prediction. \n",
    "    - Train a linear model, e.g. Logistic Regression\n",
    "    - Train a non-linear model, e.g. Random Forest\n",
    "6. Evaluate the model\n",
    "    - Compare between model perfromance\n",
    "    - Determine the important features for mortality prediction\n",
    "    - Compare model performance between various types of features\n",
    "7. Re-implement time-series specific features\n",
    "    - $\\mathbf{a}_{l}$, the autocorrelation coefficients at various time lags, $l$\n",
    "    - feel free to come up with your own, for example windowing the time-series\n",
    "    - train your models again and compare model performance between statistcial and time-series feature types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "boolean-lightweight",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os\n",
    "import math\n",
    "import warnings\n",
    "import itertools\n",
    "import numbers\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import roc_curve, auc, roc_auc_score\n",
    "\n",
    "import os\n",
    "from os.path import expanduser\n",
    "home = expanduser(\"~\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "established-waters",
   "metadata": {},
   "source": [
    "## Load the data:\n",
    "The data (``X_data.npy``) is is stored as $\\mathbf{X} \\in \\mathbb{R}^{N\\times T \\times P}$, where $N$ are the number of patients, $T$ are the number of time-steps and $P$ are the number of features. A corresponding column vector of labels (``y_data.npy``), $\\mathbf{y} \\in \\mathbb{R}^{N\\times 1}$, denote patient mortality at time $T+1$ A list of the clinical measurement names (i.e. the features) are stored in ``feature_names.txt``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "devoted-stuart",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''load the data'''\n",
    "X=np.load('./data/X_data.npy')\n",
    "y=np.load('./data/y_data.npy')\n",
    "feature_names=pd.read_csv('./data/feature_names.txt').to_numpy().reshape(-1)\n",
    "\n",
    "print('X shape (N x T x P):', X.shape)\n",
    "print('y shape:', y.shape)\n",
    "print('number of features', len(feature_names))\n",
    "print('features names:',feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respiratory-general",
   "metadata": {},
   "source": [
    "We can see that the features are various physiological measurements, such as heart rate, blood pressure and respiratory rates. Other measurements include the patient's demographics, such as sex, or physical attributes such as weght and height. A clinican-rated scoring system is also recorded, known as the Glasgow Coma Scale (GCS). These consist questions from three domains, (1) Eye response, (2) Verbal response (3) Motor response. These features and sub-domains have been coded as binary response or categorical variables. For more information see\n",
    "https://www.glasgowcomascale.org/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "designing-cooper",
   "metadata": {},
   "source": [
    "## 1. Task: Visualse the distribution of features & an example over time\n",
    "- <b>Hint</b>: summerise the features over time first using the median or mean\n",
    "- use a for loop\n",
    "- take an example subject for visualising over time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "settled-platinum",
   "metadata": {},
   "source": [
    "## 2. Task: Generate summary features & time-series specific features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "level-internet",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''feature extraction'''\n",
    "#the mean\n",
    "X_m=np.mean(X, axis=1)\n",
    "\n",
    "#the variance\n",
    "X_sd=np.std(X, axis=1)\n",
    "\n",
    "#the delta\n",
    "X_delta=#...calculate the delta/slope of the features\n",
    "X_delta[np.isnan(X_delta)]=0 #remove nan values\n",
    "\n",
    "#(1) mean feature value\n",
    "X_data_1=X_m\n",
    "feature_names_1=feature_names\n",
    "print('new fetaure matrix shape (mean):\\nX data', X_data_1.shape)\n",
    "\n",
    "'''hint: concatonate the features together into a new feature matrix and create a new vector with the feature names (i.e. the mean, the variance, etc.)'''\n",
    "#(2) mean feature value + variability of feature\n",
    "X_data_2=np.concatenate((X_m, X_sd), axis=1)\n",
    "feature_names_2=np.concatenate((feature_names + '(mean)', feature_names + '(std)'), axis=0)\n",
    "print('new fetaure matrix shape (mean + sd):\\nX data', X_data_2.shape)\n",
    "\n",
    "#(3) mean feature value + slope of feature\n",
    "X_data_3=np.concatenate((X_m, X_delta), axis=1)\n",
    "feature_names_3=np.concatenate((feature_names + '(mean)', feature_names + '(delta)'), axis=0)\n",
    "print('new fetaure matrix shape (mean + slope):\\nX data', X_data_3.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "animated-florence",
   "metadata": {},
   "source": [
    "## Task: Implement stratified k-fold cross validation\n",
    "Split the data into training and validation, test sets\n",
    "- <b>Hint</b>, get the pipleine running with one split only, they merge your pipeline into a loop\n",
    "- the code below demonstrates how to ranomly split the data (non k-fold)\n",
    "- Either manually create a k-fold validation split, or use ScikitLearn's functions:\n",
    "https://scikit-learn.org/stable/modules/cross_validation.html#stratification\n",
    "- <b>Hint</b>: make sure the distributions are stratifed between train, validation and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "thrown-pepper",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''example of making one split, using mean features'''\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_data_1, y, test_size=0.33, random_state=42, stratify=y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adopted-diving",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''helper function to calculate class distributions'''\n",
    "import collections\n",
    "def calculate_class_distributions(labels, numeric_labels, response):\n",
    "    \n",
    "    counter=collections.Counter(sorted(response))\n",
    "    counts=np.array(list(counter.values()))\n",
    "    keys=np.array(list(counter.keys()))\n",
    "\n",
    "    print('Distribution : n={:6.0f}'.format(len(response)))\n",
    "    for ix, label in enumerate(labels):\n",
    "        print('{}: {:10s}: n={:6.0f} ({:3.2f}%)'.format(keys[ix], label, counts[ix], counts[ix]/len(response)*100))\n",
    "\n",
    "    return counter\n",
    "\n",
    "print('Training')\n",
    "calculate_class_distributions(labels=['alive','mortality'], numeric_labels=[0, 1], response=y_train);\n",
    "print('Testing')\n",
    "calculate_class_distributions(labels=['alive','mortality'], numeric_labels=[0, 1], response=y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "constant-bidder",
   "metadata": {},
   "source": [
    "We have uneven distributions of alive after timestep $T+1$ versus a mortality, we therefore have to consider modeling this as an imbalanced data problem;\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "trying-anger",
   "metadata": {},
   "source": [
    "## 3. Task: Apply the necessay pre-processing steps \n",
    "    - normalisation\n",
    "    - outlier removal: mean imputation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "printable-connecticut",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''task: normalisation'''\n",
    "def zscore_data(x):\n",
    "    '''...write a function to standardise the data using the zscore'''\n",
    "    return x \n",
    "\n",
    "X_test=zscore_data(X_test)\n",
    "X_train=zscore_data(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hazardous-eight",
   "metadata": {},
   "source": [
    "Perform outlier removal using the following cases:\n",
    "$$\n",
    "x_i=\\begin{cases}\n",
    "\t\t\t\\bar{\\mathbf{x}}, & \\text{if $x_i$ > $\\bar{\\mathbf{x}} \\pm (\\alpha\\times\\sigma)$}\\\\\n",
    "            x_i, & \\text{otherwise}\n",
    "\t\t \\end{cases}, \\forall \\ \\mathbf{x} \\in \\mathbf{X}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "engaging-composite",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''task: outlier removal'''\n",
    "# - impute as the mean value if the feature value is > mean + threshold*std\n",
    "def remove_outliers(X, sd_threshold=5):\n",
    "    '''...write a function to remove the outliers in the data'''\n",
    "    return X\n",
    "\n",
    "X_train=remove_outliers(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nearby-effectiveness",
   "metadata": {},
   "source": [
    "## Task: Train and compare two off-the-shelf machine learning models (linear model)\n",
    "In this section we will now train our off-the-shelf machine leanring (ML) model. A great starting point is to use a simple linear model, such as logistic regression, which essentially describes a linear regression with a (non-linear) logistic output to perform classification. \n",
    "\n",
    "- You are free to use any machine learning model you wish, or compare between models\n",
    "\n",
    "### Background\n",
    "\n",
    "#### Linear Regression\n",
    "A linear regression model explicitly describes a relationship between predictor(s) $\\mathbf{X} \\in \\mathbb{R}^{N\\times P}$ and continuous response variables $\\mathbf{y} \\in \\mathbb{R}^{N}$. For an $i^{th}$ observation row of $\\mathbf{X}$, $\\mathbf{x}\\equiv\\mathbf{x_i}\\in\\mathbb{R}^{1\\times P}$:\n",
    "\\begin{align}\n",
    "        \\hat{y} &=w_0 + w_1x_1 + w_2x_2+\\dotsc+w_Px_P + \\epsilon\\\\\n",
    "        &=w_0 + \\sum_{j=1}^Pw_jx_j + \\epsilon\\\\\n",
    "        &=\\mathbf{w}^\\top\\mathbf{x} + b \\\\\n",
    "        &=\\mathbf{w}^\\top\\mathbf{x}\n",
    "\\end{align}\n",
    "where $w_j$ values denote the slope (weights, or regression coefficients) of the $x_j$ features; $w_0$ is the intercept term; and $\\epsilon$ denote the residual (model) errors term, which are assumed to be normally distributed with constant variance, $\\epsilon\\sim\\mathcal{N}(0,\\sigma^2)$ [1,2-3]. Often a linear model is described in vector notation, $\\mathbf{w}=[w_1, w_2, ..., w_P]$, where $w_0$ is denoted as the <em>bias</em>, $b$ term, and the $\\epsilon$-term is often omitted. More succinctly, a linear model can be also represented by the equation $\\mathbf{w}^\\top\\mathbf{x}$, where $\\mathbf{w}=[w_0, w_1,w_2,...,w_{P+1}]$ is a vector of regression coefficients, including $w_0$ (or $b$) as the first value and $\\mathbf{x}\\in\\mathbb{R}^{1\\times(P+1)}$ which is first concatenated with a vector column of ones to account for $w_0$ (or $b$) in $\\mathbf{w}$ at the first index. \n",
    "\n",
    "#### Logistic Regression\n",
    " Generalised linear models (GLMs) are extensions of linear regression models that can have non-linear outputs [2]. GLMs utilise canonical link functions, $\\phi$, to transform the outputs of a linear regression:  $\\varphi=\\mathbf{w}^\\top\\mathbf{x}$ to another distribution, such as with a logistic $\\phi = \\sigma(\\varphi)$ link function (or inversely the logit, representing the log-odds) which will be used to form Logistic Regression for binary classification tasks [1,2]:\n",
    " \\begin{equation}\n",
    "     \\sigma(x)=\\frac{1}{1+e^{-x}}\n",
    " \\end{equation}\n",
    " in this case $\\phi$ is sigmoidal and is bounded between $[0,1]$, therefore the output of $\\sigma$ can be interpreted as the probability of $y=1$: \n",
    " \\begin{equation}\n",
    "    p(\\mathbf{x}; \\mathbf{w})=\\frac{1}{1+e^{-(\\mathbf{w^\\top}\\mathbf{x})}}\n",
    "\\end{equation}\n",
    "A threshold can be applied to the probabilistic output $p$ to determine a classification prediction $\\hat{y}$ for a Logistic Regression model; threshold values are typically chosen as 0.5, but this can be altered based on the use case.  \n",
    "\n",
    "#### Regularisation $\\rightarrow$ Feature Selection\n",
    "Many statistical and machine learning models can easily overfit to the training data, resulting in poorer estimations, models that are not generalisable or too complex. Regularisation is often introduced to mitigate against this. For example large coefficient values in a regression can be penalised by adding a regularisation term to a loss function, or through reducing the number of parameters or features used in a model. The most common regularisers use the $\\ell_p$-norm defined by [1,5]:\n",
    "\n",
    "$$\n",
    "   ||\\mathbf{x}||_p=\\left(\\sum_{i=1}^N|x_i|^p\\right)^{1/p}\n",
    "$$ \n",
    "for any $\\mathbf{x}\\in\\mathbb{R}^{N\\times1}$, where the real number $p\\geq1$ defines the $\\ell_p$ space.\n",
    "\n",
    "#### LASSO for Logistic Regression\n",
    "The Least Absolute Shrinkage and Selection Operator (LASSO) [5-6] is a technique that conversely solves the  $\\ell_1\\mathrm{-penalised}$ sum of squares in a linear regression such that:\n",
    "$$\n",
    "    \\hat{\\mathbf{w}} =\\underset{\\mathbf{w}}{\\operatorname{argmax}}\\left\\{\\sum_{i=1}^N|y_i-w_0\\sum_{j=1}^{P}w_{j}x_{ij}|^2 + \\lambda\\sum_{j=1}^P|w_j|\\right\\}\n",
    "$$\n",
    "This is equivalent to minimising the sum of squares with a constraint of the form: $||w||_1 = \\sum_{j}^N|w_j|\\leq t$. \n",
    "Because of the form of the $\\ell_1$-penalty, LASSO both shrinks coefficients but also encourages sparsity in a model's parameters and thus inherently forms feature selection, shrinking non-important features to zero. \n",
    "\n",
    "The LASSO can also be extended to perform feature selection for classification by substituting a canonical link function (such as the logistic $\\phi=\\sigma(\\varphi)$ and following the same procedure, essentially performing regularised-logistic regression [3]).\n",
    "\n",
    "LASSO regularisation for logistic regression can therefore be employed in order to reduce the dimensions of the extracted feature space into a ranked parsimonious set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "offensive-moderator",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''run logistic regression with l1 regularisation (i.e. LASSO)'''\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "#LAMBDA=....set and play around with this value\n",
    "\n",
    "clf_l1_LR=LogisticRegression(random_state=42, penalty='l1', C=LAMBDA, solver=\"saga\", tol=0.001, class_weight='balanced')\n",
    "clf_l1_LR.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "military-associate",
   "metadata": {},
   "source": [
    "<b>hint</b> use the ``class_weight='balanced'`` parameter to account for imbalanced data in the loss fucntion. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fourth-vietnamese",
   "metadata": {},
   "source": [
    "## Ancillary task: \n",
    "- optimse over $\\lambda$ values (i.e. parameter C), using internal cross-validation using training and validation sets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "solar-departure",
   "metadata": {},
   "source": [
    "## Task: Train and compare two off-the-shelf machine learning models (non-linear model)\n",
    "\n",
    "### Random Forest\n",
    "Classification and Regression Trees (CART) specifically Random Forests (RF) are a multi-functional, non-linear method capable of performing regression, classification and feature selection [1,4]. Unlike the linear filter-based methods of feature selection introduced thus far, for example, LASSO (which are also often used to (linearly) pre-select features prior to application in (non-linear) methods, such as SVMs), RFs incorporate non-linear feature selection as part of the model methodology. \n",
    "\n",
    "*Below is a simple diagram describing the construction of a Random Forest (RF) model.*\n",
    "  ![RF_schematic.png](./RF_schematic.png)\n",
    "\n",
    "Random Forests consist of a large ensemble of decision trees arranged in a hierarchical structure, as depicted in the figure above. To build an individual tree, we recursively descend through the hierarchy, performing binary splits (decisions) at each level in the structure (a node, $j$) using a single feature $x_j\\in\\mathcal{X}^p$ based on a threshold value (splitting criterion) $s_j$, sub-partitioning the feature space $\\mathcal{X}_j$ at each node. A tree is typically expanded until all leaves are pure (i.e each partition $\\mathcal{X}_j$ represents only one class) or until all leaves contain less than the minimum number of samples in a partition $\\mathcal{X}_j$ required to split a node. While decision trees are susceptible to over-fitting, the advantage of a RF is that multiple trees can be learned, introducing a variability between the trees. An individual tree selects $m<N$ random subset of observations (with replacement), and each node considers a random subset of $p$ features for each split. Importantly, the same feature can be selected for multiple nodes in the tree, and can have different associated $s_j$ values at each node. Once all trees have been grown, each of the <em>weaker</em> decisions are aggregated (or \\textit{ensembled}) creating a robust final prediction. \n",
    "Classification predictions are deduced as the majority class label of the observations present in each final partition $\\mathcal{X}_j$, whereas for continuous prediction (i.e. regression) the mean of the (continuous) responses would be calculated instead. To determine the optimal split criterion $s_j$ for each $x_j$ to create each $\\mathcal{X}_j$ we evaluate the \\textit{Gini} importance, which quantifies the average gain of purity (i.e. the presence of one class) caused by splits of a given variable.\n",
    "RFs have relatively little hyperparamater tuning, which typically only include the number of trees to build ($k=1500$ in this thesis) and the number of input variables chosen at each node ($p$). Values of $p$ are suggested in [8] are: $p \\in \\{\\sqrt{P}, 2\\sqrt{P}, \\sqrt{P}/2 \\}$.\n",
    "\n",
    "### Balanced Random Forest\n",
    "In this example we utalise a balanced Random Forest: `BalancedRandomForestClassifier`, which randomly under-samples each boostrap sample to balance the classes during training. For more details see: https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "binding-application",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''balanced Random Forest'''\n",
    "from imblearn.ensemble import BalancedRandomForestClassifier\n",
    "#mfeats=...students should set this value\n",
    "\n",
    "clf_RF = BalancedRandomForestClassifier(\n",
    "    max_features=mfeats,\n",
    "    n_estimators=2000,\n",
    "    replacement=True,\n",
    "    sampling_strategy='not minority',\n",
    "    oob_score=True,\n",
    "    n_jobs=4,\n",
    "    random_state=42,\n",
    "    verbose=1\n",
    ")\n",
    "clf_RF.fit(X_train, y_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ranging-sheffield",
   "metadata": {},
   "source": [
    "## Task: Evaluate the model(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "signal-eating",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''helper functions to compute evluation metrics'''\n",
    "import sklearn.metrics as metrics\n",
    "\n",
    "def get_one_hot(targets, nb_classes):\n",
    "    '''numpy version'''\n",
    "    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]\n",
    "    return res.reshape(list(targets.shape)+[nb_classes])\n",
    "\n",
    "def compute_scores(y_true, y_pred, y_prob=None):\n",
    "    ''' Compute a bunch of scoring functions '''\n",
    "    auc=np.nan\n",
    "    confusion = metrics.confusion_matrix(y_true, y_pred)\n",
    "    per_class_recall = metrics.recall_score(y_true, y_pred, average=None)\n",
    "    accuracy = metrics.accuracy_score(y_true, y_pred)\n",
    "    f1=metrics.f1_score(y_true, y_pred, pos_label=1)\n",
    "    balanced_acuracy = metrics.balanced_accuracy_score(y_true, y_pred)\n",
    "    kappa = metrics.cohen_kappa_score(y_true, y_pred)\n",
    "    if y_prob is not None:\n",
    "        auc=metrics.roc_auc_score(get_one_hot(y_true,  y_prob.shape[1]), y_prob)\n",
    "    return {\n",
    "        'confusion': confusion,\n",
    "        'per_class_recall': per_class_recall,\n",
    "        'accuracy': accuracy,\n",
    "        'balanced_accuracy': balanced_acuracy,\n",
    "        'kappa': kappa,\n",
    "        'F1': f1, \n",
    "        'AUROC': auc,\n",
    "    }\n",
    "\n",
    "def print_scores(scores):\n",
    "    print('Accuracy: {:.3f} | Balanced Accuracy: {:.3f} | AUROC: {:.3f} | Kappa: {:.3f} | F1: {:.3f}'.format(scores['accuracy'], scores['balanced_accuracy'], scores['AUROC'], scores['kappa'], scores['F1']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "collect-juice",
   "metadata": {},
   "source": [
    "#### (1) Evaluate LASSO-Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "miniature-crowd",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''evalute the logistic regression model'''\n",
    "y_train_pred = clf_l1_LR.predict(X_train)\n",
    "y_test_pred = clf_l1_LR.predict(X_test)\n",
    "y_test_prob = clf_l1_LR.predict_proba(X_test) \n",
    "\n",
    "print('Logstic Regression')\n",
    "print_scores(compute_scores(y_test, y_test_pred, y_test_prob))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "controlling-bullet",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
    "\n",
    "labelstr=['alive', 'mortality']\n",
    "CM=confusion_matrix(y_test,y_test_pred, )\n",
    "nCM=CM/sum(CM[:])\n",
    "\n",
    "disp = ConfusionMatrixDisplay(confusion_matrix=CM,\n",
    "                    display_labels=labelstr)\n",
    "print('confusion matrix')\n",
    "disp.plot();plt.show()\n",
    "\n",
    "ndisp = ConfusionMatrixDisplay(confusion_matrix=nCM,\n",
    "                    display_labels=labelstr)\n",
    "print('normalised confusion matrix')\n",
    "ndisp.plot(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "artificial-belief",
   "metadata": {},
   "source": [
    "#### (2) Evaluate Balanced Random Forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "consistent-given",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train_pred = clf_RF.predict(X_train)\n",
    "y_test_pred = clf_RF.predict(X_test)\n",
    "y_test_prob = clf_RF.predict_proba(X_test) \n",
    "\n",
    "print('Random Forest')\n",
    "print_scores(compute_scores(y_test, y_test_pred, y_test_prob))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "clear-heavy",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
    "\n",
    "labelstr=['alive', 'mortality']\n",
    "CM=confusion_matrix(y_test,y_test_pred, )\n",
    "nCM=CM/sum(CM[:])\n",
    "\n",
    "disp = ConfusionMatrixDisplay(confusion_matrix=CM,\n",
    "                    display_labels=labelstr)\n",
    "print('confusion matrix')\n",
    "disp.plot();plt.show()\n",
    "\n",
    "ndisp = ConfusionMatrixDisplay(confusion_matrix=nCM,\n",
    "                    display_labels=labelstr)\n",
    "print('normalised confusion matrix')\n",
    "ndisp.plot(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "animated-retail",
   "metadata": {},
   "source": [
    "## Task: Determine the important features for mortality prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conditional-andorra",
   "metadata": {},
   "source": [
    "### (1) Feature Importance: LASSO-Logistic Regression\n",
    "- students should create a feature ranking table from best feature to worst, ranked by coefficent value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "temporal-hierarchy",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''determine the remaning number of important features'''\n",
    "coef_l1_LR = clf_l1_LR.coef_.ravel()\n",
    "sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100\n",
    "print('Total number of features: {:}\\nNumber of features retained: {:}\\nNumber of features removed: {:}\\nSparsitiy introduced: {:.2f}%'. format(len(coef_l1_LR), np.sum(coef_l1_LR != 0), np.sum(coef_l1_LR == 0), sparsity_l1_LR))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "imposed-contents",
   "metadata": {},
   "source": [
    "### (2) Feature Importance: Random Forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "immune-necklace",
   "metadata": {},
   "outputs": [],
   "source": [
    "tree_importance_sorted_idx = np.argsort(clf_RF.feature_importances_)\n",
    "feature_importance=clf_RF.feature_importances_[tree_importance_sorted_idx]\n",
    "tree_indices = np.arange(0, len(feature_importance)) + 0.5\n",
    "\n",
    "#take the first 25 features\n",
    "nfeats=25\n",
    "tree_importance_sorted_idx=tree_importance_sorted_idx[-nfeats:]\n",
    "feature_importance=feature_importance[-nfeats:]\n",
    "tree_indices = np.arange(0, nfeats) + 0.5\n",
    "\n",
    "\n",
    "fig, ax = plt.subplots(1, 1, figsize=(12, 8))\n",
    "ax.barh(tree_indices,\n",
    "         feature_importance, height=0.7)\n",
    "ax.set_ylim((0, len(tree_indices)))\n",
    "ax.set_yticks(tree_indices)\n",
    "ax.set_yticklabels(feature_names_1[tree_importance_sorted_idx])\n",
    "ax.set_xlabel('feature importance')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "desirable-advisory",
   "metadata": {},
   "source": [
    "# Additonal Task: Re-implement time-series specific features \n",
    "(This is an additonal task if you have time. We'll learn more about autoregressive models later in the course.)\n",
    "For this example, we will implement use the assumptions of autoregressive models (AR) to evaluate time-series specific features. We can estimate the relationship between the value at any point $t$, $x_t$, and the value at any point $(t-1)$, $x_{t-1}$ using:\n",
    "\n",
    "\\begin{equation}\n",
    "    AR(1)=x_t\\approx a_1x_{t-1}+c\n",
    "\\end{equation}\n",
    "\n",
    "where $a_1$ are the weights of the model and c is the constant. $AR(1)$ denotes that this is a first order model.\n",
    "\n",
    "\n",
    "We can compute the autocorrelation within the time-series $\\mathbf{x}\\in\\mathbb{R}^{1\\times N}$ [9, 10]. The autocorrelation function measures the correlation between $x_t$ and $x_{t+k}$, at various lags, $k$, where $k = 0,\\ldots,K$. The autocorrelation for lag $k$ is: \n",
    "\\begin{equation}a_k=\\frac{c_k}{c_0} \\end{equation} \n",
    "\n",
    "\\begin{equation} c_k=\\frac{1}{T}\\sum\\limits_{t=1}^{T-k}(x_t - \\mathbf{\\bar{x}})(x_{t+k}-\\mathbf{\\bar{x}}) \\end{equation}\n",
    "where $\\mathbf{\\bar{x}}$ is the mean of $\\mathbf{x}$; $c_0$ is the sample variance of the time-series.\n",
    "\n",
    "\n",
    "In this work we can extract the auto-correlation of some physiological measurement at various lags, $k$, in order to extract information about stochasticity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "quick-singapore",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''Task: First pull out the continous feature values, i.e. the physiological measurements such as heart rate and respiratoration. '''\n",
    "#fidx=...the index of the features you want to pull\n",
    "X_=X[:, :, fidx]\n",
    "print('X data (shape)', X_.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "static-receipt",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''define our autocorrelation function (ACF)'''\n",
    "#see https://stackoverflow.com/questions/643699/how-can-i-use-numpy-correlate-to-do-autocorrelation \n",
    "def autocorrelation(x,nlags=15):\n",
    "    '''normalised, full autocorrelation'''\n",
    "    lags=np.arange(nlags)\n",
    "    mean=x.mean()\n",
    "    var=np.var(x)\n",
    "    xp=x-mean\n",
    "    corr=np.correlate(xp,xp,'full')[len(x)-1:]/var/len(x)\n",
    "\n",
    "    return corr[:len(lags)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "continuous-replica",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''plot the autocorrelation for varying time lags for an example subject and feature'''\n",
    "#example_index=...the index of the subject you want to examine\n",
    "\n",
    "xx=X[example_index, :, feature_names=='...insert feature name...'].squeeze()\n",
    "nlags=np.floor(len(xx)).astype(int)\n",
    "lags=np.arange(nlags)\n",
    "\n",
    "r=autocorrelation(xx, nlags=nlags)\n",
    "plt.stem(r)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "brutal-subsection",
   "metadata": {},
   "source": [
    "We will be able the autocorrelation of the heart rate for varying time lags. Below we can use the statsmodels.py plots to generate an ACF with confidence limits. See https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html for more details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "liked-verification",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''plot the autocorrelation for varying time lags'''\n",
    "from statsmodels.graphics.tsaplots import plot_acf\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))\n",
    "\n",
    "ax1.plot(xx)\n",
    "\n",
    "ax1.set_ylabel('feature')\n",
    "ax1.set_xlabel('time')\n",
    "\n",
    "# Use the Autocorrelation function\n",
    "# from the statsmodel library passing\n",
    "plot_acf(x=xx,  lags=lags, ax=ax2)\n",
    "\n",
    "ax2.set_ylabel('r')\n",
    "ax2.set_xlabel('lag')\n",
    "ax2.set_xticks(lags)\n",
    "\n",
    "fig.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "interim-progressive",
   "metadata": {},
   "source": [
    "Next we can extract autocorrelation features over entire dataset of continuous feature values, i.e. the physiological measurements such as heart rate and resp. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dried-brazilian",
   "metadata": {},
   "outputs": [],
   "source": [
    "'''extract autocorrelation features over the data'''\n",
    "def extract_autocorr_features(data, nlags):\n",
    "    '''function to extract autocorrelation features on a data matrix'''\n",
    "    rcorr = np.zeros((data.shape[0], nlags, data.shape[2]))\n",
    "    for i, x in enumerate(data):\n",
    "        x=np.transpose(x)\n",
    "        for j, xx in enumerate(x):\n",
    "            rcorr[i, :, j]=autocorrelation(xx, nlags=nlags)       \n",
    "\n",
    "    rcorr[np.isnan(rcorr)]=0        \n",
    "\n",
    "    return rcorr\n",
    "\n",
    "nlags=np.floor(X_.shape[1]/2).astype(int)\n",
    "XAC=extract_autocorr_features(X_, nlags=48)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "retained-wealth",
   "metadata": {},
   "source": [
    "We can then combine these time-series specific features with the rest of the categorical variables.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "every-concentrate",
   "metadata": {},
   "outputs": [],
   "source": [
    "#combine these time-series specific features with the rest of the categorical variables\n",
    "XAC=XAC.reshape(-1, XAC.shape[1]*XAC.shape[2])\n",
    "X_data_new=np.concatenate((X_data_1[:, ~fidx], XAC), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "public-image",
   "metadata": {},
   "source": [
    "## Task: train your models again and compare model performance between statistcial and time-series feature types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dangerous-median",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_data_new, y, test_size=0.33, random_state=42, stratify=y)\n",
    "\n",
    "X_test=zscore_data(X_test)\n",
    "X_train=zscore_data(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "diagnostic-wrong",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "clf_l1_LR=LogisticRegression(random_state=42, penalty='l1', C=LAMBDA, solver=\"saga\", tol=0.001, class_weight='balanced')\n",
    "clf_l1_LR.fit(X_train, y_train)\n",
    "\n",
    "y_train_pred = clf_l1_LR.predict(X_train)\n",
    "y_test_pred = clf_l1_LR.predict(X_test)\n",
    "y_test_prob = clf_l1_LR.predict_proba(X_test) \n",
    "\n",
    "print('Logstic Regression')\n",
    "print_scores(compute_scores(y_test, y_test_pred, y_test_prob))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "split-exploration",
   "metadata": {},
   "outputs": [],
   "source": [
    "from imblearn.ensemble import BalancedRandomForestClassifier\n",
    "\n",
    "clf_RF = BalancedRandomForestClassifier(\n",
    "    n_estimators=2000,\n",
    "    replacement=True,\n",
    "    sampling_strategy='not minority',\n",
    "    oob_score=True,\n",
    "    n_jobs=4,\n",
    "    random_state=42,\n",
    "    verbose=1\n",
    ")\n",
    "clf_RF.fit(X_train, y_train)\n",
    "\n",
    "y_train_pred = clf_RF.predict(X_train)\n",
    "y_test_pred = clf_RF.predict(X_test)\n",
    "y_test_prob = clf_RF.predict_proba(X_test) \n",
    "\n",
    "print('Random Forest')\n",
    "print_scores(compute_scores(y_test, y_test_pred, y_test_prob))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "restricted-uruguay",
   "metadata": {},
   "source": [
    "We should be able to get a modest improvement in performance with the addition of these new features. Whilst the ACF allows us to capture the temporal properties of a signal and it's stochasticity, we are limited by how our features can capture this information. These approaches, however, are constrained transformations and approximations of ambulatory function which are based on prior assumptions. Hand-crafted gait features are often established signal-processing or statistical metrics re-purposed as surrogates to represent timporal aspects of a signal; for instance, extracting the variance in a sensor signal in an attempt to variability. There however may be greater power in allowing an algorithm to learn its own features, termed representation learning. Deep learning is an overarching term given to representation learning, where multiple levels of representation are obtained through the combination of a number of stacked (hence deep) non-linear model layers. Deep learning models typically describe convolutional neural networks (CNN), deep neural networks (DNN), and combined fully-connected deep convolutional neural networks (DCNN) architectures. Other architectures include recurrent neural networks (RNN), such as Long Short Term Memory (LSTM) networks, which are especially adept at modelling sequential time-series data. \n",
    "\n",
    "In the next lectures and lab we will look at how we can use RNNs to model this time-series based mortality prediction task. \n",
    "\n",
    "<em>Aside</em>: Another task would be to look at the spectral properties of the signal and generate new features. You will learn more about data transformations in Wednesday's lecture."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "important-netherlands",
   "metadata": {},
   "source": [
    "### References\n",
    "1. T. Hastie, R. Tibshirani, and J. Friedman,The elements of statistical learning: data mining,inference, and prediction.  Springer Science & Business Media, 2009.\n",
    "1. C. R. Rao, Linear statistical inference and its applications.  Wiley New York, 1973, vol. 2.\n",
    "1. P. McCullagh,Generalized linear models.  Routledge, 2018.\n",
    "1. D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant,Applied logistic regression.  John Wiley &Sons, 2013, vol. 398.\n",
    "1. T. Hastie, R. Tibshirani, and M. Wainwright,Statistical learning with sparsity: the lasso and generalizations.  CRC press, 2015.\n",
    "1. R. Tibshirani, “Regression shrinkage and selection via the lasso,”Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.\n",
    "1. L. Breiman, “Random forests,”Machine learning, vol. 45, no. 1, pp. 5–32, 2001.\n",
    "1. C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, “Conditional variable importancefor random forests,”BMC bioinformatics, vol. 9, no. 1, p. 307, 2008.\n",
    "1. Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: forecasting and control. John Wiley & Sons\n",
    "1. Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge University Press."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "d84c229c30796be5029848b176402793951a8e3372b39468a48ffbc5d33b3318"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
